HTML API: Serialize decoded carriage returns as character references by sirreal · Pull Request #42 · sirreal/wordpress-develop

sirreal · 2026-06-10T09:00:19Z

Note: this PR was rebuilt on top of the spec-compliant getters branch
(spec-compliant-getters / #53) and trimmed. It must not land before that branch:
the CR escaping below is only correct over preprocessing-correct getters —
otherwise a raw source CR would be wrongly preserved as  where
browsers normalize it to a line feed. The previous revision of this branch
carried read-path workarounds for that problem; they are now obsolete.

Problem

The serializer emits decoded carriage returns raw into text and attribute
values. The HTML parser's input preprocessing turns a raw CR into a line feed
on the next parse, so normalize() output never reaches a fixed point for
documents containing :

normalize( '<p>a&#13;b</p>' )  → "<p>a\rb</p>"
normalize( "<p>a\rb</p>" )     → "<p>a\nb</p>"   // text changed again

Changes

serialize_decoded_text(): escapes with htmlspecialchars(), then
emits CR as  and U+0000 as U+FFFD. Used for #text nodes, attribute
values, and RCDATA-ish element contents (TITLE/TEXTAREA closers).
SCRIPT/STYLE contents remain raw, and comments are untouched — character
references do not decode in those contexts.

This is a deliberate, disclosed deviation from browser innerHTML, which
emits the raw CR/NUL and loses them on reparse. Emitting the references
keeps serialized output idempotent and preserves the decoded values exactly
through parse/serialize round trips.
Serializer NULL handling consolidated: with the getters now exposing
tag and attribute names with NULL bytes already replaced (HTML API: Apply input preprocessing consistently at Tag Processor read boundaries #53), the serializer's name-scrubbing band-aids were dead code
and are removed. The only live NULL source — API-supplied attribute
values — is handled in serialize_decoded_text(). UTF-8 scrubbing and the
seen-names dedupe remain (still live for invalid UTF-8 in source names).
Dropped from the previous revision: get_attribute_for_serialization()
(serialization reads through the now-correct get_attribute()) and the
class_name_updates_to_attributes_updates() fix (absorbed into the getters
branch).

Results

Adversarial-review fuzzing over a 4,000-document hostile corpus: normalize()
double-pass non-idempotence drops from 1,359 documents on trunk to 0, with
zero behavioral regressions outside the intended  emissions. Remaining
instability on a second structured corpus is a proven strict subset of trunk's
pre-existing issues (XMP escaping, foreign-content TEXTAREA), tracked
separately on #65372.

Testing

Regression tests landed red-first: decoded-CR cases across text, RCDATA,
attributes, tables, and templates, each asserting idempotence; raw-CR cases
pin the getter behavior this branch depends on; reparse assertions pin the
round-trip contract; rawtext (SCRIPT/STYLE) contents pinned unescaped. Full
html-api and html-api-html5lib-tests groups pass.

(Trac ticket 65372)

github-actions · 2026-06-10T09:29:10Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props wildworks, dmsnell, sergeybiryukov, desrosj, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

…ation # Conflicts: # src/wp-includes/html-api/class-wp-html-processor.php

Red TDD step: browser-verified expectations for raw CR/CRLF/NUL in attribute values; passing pins for encoded / and for verbatim pass-through of API-supplied values. See #65372.

Attribute values read from the input document now normalize newlines (CRLF/CR to LF) and replace U+0000 NULL bytes with U+FFFD before decoding character references, matching what browsers produce for the same markup. Values enqueued through set_attribute() are plaintext API values and continue to pass through unchanged. See #65372.

Red TDD step: flushing add_class()/remove_class() updates must read the existing class attribute through the same input preprocessing as get_attribute(), normalizing newlines and replacing NULL bytes. See #65372.

class_name_updates_to_attributes_updates() reads the existing class value through the same preprocessing helper as get_attribute(), so add_class()/remove_class() no longer rebuild the attribute from raw source bytes containing CR or NULL. See #65372.

Red TDD step: browser-verified expectations that attribute names are exposed and addressed with U+FFFD replacing NULL bytes, that names collapsing after replacement behave as duplicates of one attribute, and that attribute updates target the replaced name. See #65372.

Attribute lookup keys are normalized where they are created, in parse_next_attribute(): NULL bytes are replaced with U+FFFD before lowercasing, as the tokenizer does in browsers. Names which collapse to the same replaced name are duplicates of one attribute (first one wins), lookups by the raw NULL spelling no longer match, and updates or removals by the replaced name target the source attribute. Raw document spans are untouched. See #65372.

Red TDD step: tag names are exposed with U+FFFD replacing NULL bytes; passing pins confirm NULL bytes never select rawtext parsing and never appear in PI-lookalike comment tag names. See #65372.

get_tag() (and get_token_name(), which delegates to it) returns tag names with U+0000 NULL bytes replaced by U+FFFD, as the tokenizer does in browsers. Internal token identification continues to compare raw bytes: a NULL byte in a tag name already prevents rawtext detection, matching browsers, where the replaced name likewise never equals SCRIPT or the other special names. See #65372.

Red TDD step: browser-verified expectation that classList-equivalent reads preserve NULL bytes in values set through the API; the U+0000 replacement belongs to the tokenizer, and document-sourced values already receive it in get_attribute(). See #65372.

class_list() received its NULL-byte replacement when reading raw class values; that replacement now happens in get_attribute() for values from the input document. Performing it on API-supplied values diverged from browsers, where classList preserves NULL bytes in values set via setAttribute(). See #65372.

Benchmark-guided: reading an attribute value applies up to three str_replace passes which doubled read cost for long values containing no bytes needing replacement. Guarding with strpos keeps the common case at two fast scans; values are typically free of CR and NULL. Benchmark (PHP 8.4, medians of 3): scanning 100-tag documents reading 3 attributes each, 2000 iterations: trunk 667ms, unguarded 714ms, guarded 699ms. Reading a 10.8KB clean attribute value 200k times: trunk 147ms, unguarded 313ms, guarded 258ms. The remaining cost is the unavoidable byte inspection. See #65372.

Red TDD step from adversarial review: a named character reference without a terminating semicolon must decode when followed by a NULL byte or any non-ASCII byte. Replacing NULL with U+FFFD before decoding fed the decoder a multi-byte follower whose classification by ctype_alnum() depends on the process locale, suppressing valid decodes in attribute values, diverging from browsers and from trunk. See #65372.

The tokenizer replaces U+0000 NULL bytes as it consumes input, so a character reference without a terminating semicolon sees the raw NULL byte as its follower, which is unambiguous, and the reference decodes. Replacing before decoding handed the decoder U+FFFD's lead byte, whose ctype_alnum() classification depends on the process locale, wrongly suppressing the decode under UTF-8 locales. No character reference decodes into NULL, so replacing after decoding is equivalent for the value's own bytes and faithful to the tokenizer's order. See #65372.

Per the named-character-reference state, a semicolon-less reference is ambiguous only when followed by an ASCII alphanumeric or equals sign. ctype_alnum() classifies bytes 0x80 and above as alphanumeric under UTF-8 locales, wrongly suppressing decodes followed by any non-ASCII byte and making decoding depend on the process locale. See #65372.

Red TDD step from adversarial review: next_tag() must match tag names in the same U+FFFD-replaced alphabet that get_tag() exposes, so the getter round-trips into queries, raw NULL spellings match nothing, and the Tag Processor agrees with the HTML Processor, whose queries already compare against the replaced token name. See #65372.

next_tag() compared sought tag names against raw document bytes while get_tag() returns names with NULL bytes replaced by U+FFFD, breaking the getter-to-query round trip and disagreeing with the HTML Processor's queries. Matching now happens in the exposed alphabet; the existing byte comparison is unchanged for names without NULL bytes, so the hot path costs the same. See #65372.

Red TDD step from adversarial review: get_attribute( 'CLASS' ) returned a stale value when class updates were pending, because the flush guard compared the attribute name case-sensitively. See #65372.

Attribute lookups are ASCII-case-insensitive, but the pending-class flush in get_attribute() compared the requested name case-sensitively, returning a stale value for spellings like "CLASS". See #65372.

@SInCE

From adversarial review: pins for class helpers over replaced source values, boolean attributes with NULL-byte names, verbatim prefix matching in get_attribute_names_with_prefix(), and HTML Processor end-tag matching across NULL and U+FFFD spellings (browser-verified: both spellings tokenize to the same name). Documents the @SInCE 7.1.0 behavior on indirectly-affected getters and the known asymmetry of set_modifiable_text(), whose value reads back normalized unlike attribute values, which round-trip verbatim. See #65372.

Red TDD step: decoded carriage returns in text and attribute values must serialize as  so that normalized output is idempotent: a raw CR in serialized output would be normalized to a line feed when parsed again. The raw-CR attribute and class-update cases pass already through the preprocessing-correct getters and pin that behavior. See #65372.

The serializer emitted decoded carriage returns raw into text and attribute values, where input preprocessing turns them into line feeds on the next parse: normalized output never reached a fixed point for documents containing . Escaping CR after htmlspecialchars() keeps the character through parse/serialize round trips. Attribute values read through get_attribute(), whose input preprocessing guarantees raw source carriage returns already arrive normalized to line feeds, so only genuinely decoded CRs are escaped. See #65372.

An attribute value set through set_attribute() may contain NULL bytes; serializing them as U+FFFD keeps normalized output idempotent, where browsers' innerHTML emits the raw byte and loses it to replacement on the next parse. This pins the behavior ahead of consolidating the serializer's NULL handling. See #65372.

The getters now expose tag and attribute names with NULL bytes already replaced by U+FFFD, leaving the serializer's name scrubbing dead, and the only live input to the per-attribute whole-buffer scrub was an API-supplied attribute value. That replacement moves into serialize_decoded_text() next to the carriage-return escaping, which exists for the same reason: emitting bytes the next parse would transform. UTF-8 scrubbing of qualified names remains, as invalid sequences can still reach serialization through source names. See #65372.

From adversarial review: pins that SCRIPT and STYLE contents serialize without escaping, where character references do not decode, and that serialize_token() output for modified class and NULL-containing attribute values parses back to the same decoded values. See #65372.

# Conflicts: # src/wp-includes/html-api/class-wp-html-processor.php # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php

sirreal marked this pull request as ready for review June 10, 2026 09:28

sirreal changed the base branch from trunk to html-api-fuzz-fiz/decoded-cr-base June 10, 2026 09:55

This comment was marked as outdated.

Sign in to view

sirreal added a commit that referenced this pull request Jun 10, 2026

Merge PR #42: HTML API: Preserve decoded carriage returns in serializ…

6b8d48c

…ation # Conflicts: # src/wp-includes/html-api/class-wp-html-processor.php

sirreal added 24 commits June 11, 2026 18:29

HTML API: Add tests for attribute value input preprocessing.

cee0661

Red TDD step: browser-verified expectations for raw CR/CRLF/NUL in attribute values; passing pins for encoded / and for verbatim pass-through of API-supplied values. See #65372.

HTML API: Add tests for class updates over preprocessed values.

48d8fb4

Red TDD step: flushing add_class()/remove_class() updates must read the existing class attribute through the same input preprocessing as get_attribute(), normalizing newlines and replacing NULL bytes. See #65372.

HTML API: Add tests for NULL bytes in tag names.

135157f

Red TDD step: tag names are exposed with U+FFFD replacing NULL bytes; passing pins confirm NULL bytes never select rawtext parsing and never appear in PI-lookalike comment tag names. See #65372.

HTML API: Add test for case-insensitive class update flushing.

5292c7d

Red TDD step from adversarial review: get_attribute( 'CLASS' ) returned a stale value when class updates were pending, because the flush guard compared the attribute name case-sensitively. See #65372.

HTML API: Flush class updates for any case spelling of "class".

8c26adf

Attribute lookups are ASCII-case-insensitive, but the pending-class flush in get_attribute() compared the requested name case-sensitively, returning a stale value for spellings like "CLASS". See #65372.

sirreal mentioned this pull request Jun 11, 2026

HTML API: Apply input preprocessing consistently at Tag Processor read boundaries #53

Open

sirreal force-pushed the html-api-fuzz-fiz/decoded-cr branch from c86078c to 3a74497 Compare June 11, 2026 19:04

sirreal changed the title ~~HTML API: Preserve decoded carriage returns in serialization~~ HTML API: Serialize decoded carriage returns as character references Jun 11, 2026

sirreal changed the base branch from html-api-fuzz-fiz/decoded-cr-base to trunk June 11, 2026 19:07

sirreal added a commit that referenced this pull request Jun 11, 2026

Merge updated PR #42: decoded carriage return serialization

ce2af0e

# Conflicts: # src/wp-includes/html-api/class-wp-html-processor.php # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Serialize decoded carriage returns as character references#42

HTML API: Serialize decoded carriage returns as character references#42
sirreal wants to merge 24 commits into
trunkfrom
html-api-fuzz-fiz/decoded-cr

sirreal commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sirreal commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Results

Testing

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sirreal commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading